As a Public Relations Data Analyst, I constantly deal with vast amounts of scattered media monitoring data. Even though, for a short time now, PR agencies (and hopefully their clients) have not had high expectations regarding metrics and measurements. But I believe that one day, we will be compelled to gain a deeper understanding of our daily media monitoring datasets, uncovering hidden patterns that have never been discovered before.
At present, the purpose of media monitoring datasets is solely to report all online news articles based on our clients’ PR events. To ensure accountability, we must list the URL links related to each article.
But here’s the thing: We usually use the total number of articles to represent media coverage. However, if we include a dedicated column for media names in advance when building the dataset, we can easily track coverage by each media name later on. If we don’t, it could be challenging to monitor media coverage.
In such cases, we might need to manually list each media name based on the article’s URL after we have gathered all the headlines and URLs.
So, we got problems here:
We need such an automated workflow to do this task, especially to reduce human error when it comes to manual listing
We need to figure out how to track coverages as deep as possible
The DNS Hierarchy Example (src: educative.io)
sorry if I couldn’t
make this picture centered by default. it’s probably on my css params….
this long caption was made intentionally, hahaha
To gain a better understanding, I suggest that we can take a look at
this,
and this
documentation, please :)
By observing patterns in the domain/DNS hierarchy, I discovered an
approach that can provide us with valuable insights. We can effectively
utilize Regular Expressions (Regex) to remove http://,
https:// , and slashes (/) from each URL
entry. Using Regex, we can also separate all instances that come before
dots (.).
After cleaning the data with Regex, we can filter domain names and subdomains by executing several lines of code. I prefer using R-Tidyverse packages due to their higher-level functions and parameters.
In Public Relations Scope, all the subdomains might be
considered as “Media Desk” or “Media Region”. Taking The New York Times
(http://nytimes.com)
as an example, they have several media desks, namely:
cooking.nytimes.comparenting.nytimes.comadvertising.nytimes.comcityandsuburban.nytimes.com, and so on…Same here, the Indonesian media are tend to specify their subdomains based on media desks or district/city names. I fancy to take Tribunnews (http://tribunnews.com) as an example. We can find some media regions here, such as:
jabar.tribunnews.com (West Java)kaltim.tribunnews.com (East Borneo)jogja.tribunnews.com (Special Region of Yogyakarta),
and etc…To simplify the explanation of the data manipulation workflow, this portfolio relies heavily on Regex cleansing and Dplyr filtering. Regex cleansing helps us eliminate unwanted instances.
On the same importance, Dplyr is invaluable for putting ‘domain
name’ and ‘subdomains’ on a specified table, as my goal is to
retrieve and classify these two elements for a more in-depth analysis of
media desks.
Okayy, without further ado, let’s delve deeper!
http[s]://library(dplyr)
library(googlesheets4)
library(tidyr)
library(stringi)
library(stringr)
library(tidyr)
library(purrr)
library(janitor)I created a custom operator that functions as the negation of
%in%, which I call %out%. Essentially,
%out% can be used within dplyr::filter() to
exclude all the values specified in the filter()
function
# initiating %out%
`%out%` <- function(a,b) ! a %in% bgs4_deauth()
options(gargle_oauth_email = "gemforwork329@gmail.com")
dat_URL <-
read_sheet(
"https://docs.google.com/spreadsheets/d/1mKhCrBM5-uXZ_aRa72tCubgeSTZigipTG4DhLWr7t48/edit#gid=0"
)
dat_URLHere are the articles from our client. It was about a mobile phone
product launch and prelaunch. As we can see, we have gathered
4,222 different URLs from various media sources. Yep, just
the URLs.
url_list <- dat_URL %>%
select(URL) %>%
as.list() %>%
unlist()http[s]:// patterncleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl3[1:5]#> URL1
#> "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all."
#> URL2
#> "www.idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q="
#> URL3
#> "www.antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3"
#> URL4
#> "www.jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/"
#> URL5
#> "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"
www. patterncleanurl4 <- gsub("www.","",cleanurl3)
cleanurl4[1:5]#> URL1
#> "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all."
#> URL2
#> "idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q="
#> URL3
#> "antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3"
#> URL4
#> "jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/"
#> URL5
#> "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"
(/)pre_result_regex <- sapply(strsplit(cleanurl4,"/"),"[",1)
pre_result_regex[1:5]#> URL1 URL2 URL3 URL4
#> "tekno.kompas.com" "idntimes.com" "antaranews.com" "jawapos.com"
#> URL5
#> "kumparan.com"
(.)result_regex <- as.list(strsplit(pre_result_regex, "\\."))
result_regex[1:5]#> $URL1
#> [1] "tekno" "kompas" "com"
#>
#> $URL2
#> [1] "idntimes" "com"
#>
#> $URL3
#> [1] "antaranews" "com"
#>
#> $URL4
#> [1] "jawapos" "com"
#>
#> $URL5
#> [1] "kumparan" "com"
This marks the end of the Regex cleansing process. From the
result_regex list/vector, we can convert it into a data
frame.
df_media <- as.data.frame(stringi::stri_list2matrix(result_regex, byrow = TRUE), stringsAsFactors=FALSE) ; head(df_media, 30)nrow(df_media)#> [1] 4222
names(df_media) <- paste0("instance", seq_along(df_media)) ; head(df_media, 20)I am interested in inspecting all the unique values in
instance2 column, assuming there are many domain names
inside. On this phase, analysts should open to explore all the
possibilities, as they plan to gain deeper EDA insights.
instance2_unique <- df_media %>%
select(instance2) %>%
distinct()
instance2_uniqueI have around one hundred values, and it’s apparent that there are
still “non-domain name” in the instance2 column (e.g., com,
id, co, web, my, net, etc), which should be identified as top-level
domains (TLDs). We have gained further insights from this data
pattern.
instance1 column?instance1_unique <- df_media %>%
select(instance1) %>%
distinct(); instance1_uniqueAfter inspecting both the instance1 and
instance2 columns, we can infer that, for the most part,
instance1 contains numerous domain names, such as kumparan,
beritasatu, cnnindonesia, and antaranews.
I am confident that I can also identify several media desks with the
three instances domains (xxx.xxx.xxx) by simply filtering values inside
instance1_unique on the instance2 column. This operation
allows to extract domain names themselves, thanks to the data
pattern.
(using instance1_unique$instance1 as domain name
identifier)
domain_three_instances <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
# filtering root domains (instance2) that contains matching values
# of instance1_unique$instance1
filter(instance2 %in% instance1_unique$instance1) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
# Taking out two instances domain that still remain
filter(instance3 %out% NA) %>%
distinct() %>%
arrange(instance2); domain_three_instancesBingo! We got our first filtered list of three-instance domains. Whilst Instance2 contains all the domain names (media names), we can also retrieve some insights about how various these media desks are (instance1). In Public Relations scope, varieties of online media desks (a.k.a subdomains) is one thing that is so unpredictable.
(using isntance2_unique$instance2 as root domain
identifier)
This may fetch domains that contains a single TLD
domain_three_instances_1 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
# filtering root domains (instance2) that contains matching values
# of instance2_unique$instance2
filter(instance2 %in% instance2_unique$instance2) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
filter(instance3 %out% NA) %>%
distinct() %>%
# Negate our previous findings so that we can retrieve the new one
filter(instance2 %out% domain_three_instances$instance2) %>%
# Negate co, my, go as it supposed to be the SLD and TLD
filter(instance2 %out% c("co", "my", "go")) %>%
arrange(instance2)which contains both SLDs and TLDs (.my.id, .co.id, .com.my,
etc.).
In this case, domain names should be located in the
instance2 column.
I require this pattern to bind/aggregate all the results
domain_three_instances_2 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance2 %in% instance2_unique$instance2) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
filter(instance3 %out% NA) %>%
distinct() %>%
filter(instance2 %out% domain_three_instances$instance2) %>%
filter(instance2 %in% c("co", "my", "sg", "de", "com", "gov", "go")) %>%
# Temporary colnames
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(four = instance3) %>%
# Normalized colnames (instance2 as root domain)
rename(instance2 = two) %>%
rename(instance3 = three) %>%
rename(instance4 = four) %>%
arrange(instance2) ; domain_three_instances_2This process can collect all domains that contain four instances (xxx.xxx.xxx.xxx) by simply removing all missing values throughout the dataframe
domain_four_instances <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
tidyr::drop_na(instance4) %>%
distinct() %>%
arrange(instance2)
domain_four_instancesthree_instances_final <- rbind(domain_three_instances, domain_three_instances_1)three_and_fourins <- bind_rows(domain_four_instances, three_instances_final)to_filter_twoins <- three_and_fourins %>%
select(instance2) %>%
distinct()domain_two_instances_com <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance3 %in% NA) %>%
select(-instance3) %>%
filter(instance2 %in% "com") %>%
distinct() %>%
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(instance2 = two) %>%
rename(instance3 = three)
domain_two_instances_1 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance3 %in% NA) %>%
select(-instance3) %>%
filter(instance2 %out% to_filter_twoins$instance2) %>%
select(-instance4) %>%
distinct() %>%
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(instance2 = two) %>%
rename(instance3 = three)
two_instances <- bind_rows(domain_two_instances_com, domain_two_instances_1)all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)
all_instancesthree_and_fourins <- bind_rows(domain_four_instances, three_instances_final)all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)
with instance2 = domain name
all_instances <- all_instances %>%
filter(instance2 %out% "") %>%
mutate(final_domain = purrr::pmap_chr(., ~ c(...) %>%
na.omit %>%
paste(collapse = "."))) %>%
mutate(medianame = instance2) %>%
arrange(medianame)
unique_medianames <- all_instances %>%
distinct(); unique_medianamescleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl4 <- gsub("www.","",cleanurl3)
final_domain <- sapply(strsplit(cleanurl4,"/"),"[",1)
final_domain <- as.data.frame(final_domain)
final_domain <- final_domain %>%
select(final_domain) %>%
rename(cleaned_domain = final_domain)
final_domain$NO <- 1:nrow(final_domain)
unique_medianames <- unique_medianames %>%
select(final_domain, medianame) %>%
rename(cleaned_domain = final_domain)to_viz <- merge(final_domain, unique_medianames, by = "cleaned_domain", all.x = FALSE)
to_viz %>% arrange(NO)We got 4,223 rows instead of 4,222. I suppose that this condition was
indicated bymedianame_root_domain column, which is probably
multiplied based on the identical value of cleaned_domain
column.
identify <- to_viz %>%
select(cleaned_domain, medianame) %>%
distinct()
the_duplicated_one <-
identify %>% select(cleaned_domain) %>% janitor::get_dupes() %>% distinct() ; the_duplicated_oneidentify %>%
filter(cleaned_domain %in% the_duplicated_one$cleaned_domain)Huft! that was probably a mistake in filtering. This should be in my
notes to be figured out later… but for now, this condition is not a
fatal one. We can still take out the unwanted values in
medianame_root_domain column by only negating it (in other
words, ".com" should not be identified as media name)
.comto_viz <- to_viz %>%
filter(medianame %out% "com")to_viz <- to_viz %>%
arrange(NO); to_vizDone! we got 4,222 entries which is similar to our data
input
Using a special function called as.treemapDF.
I got a lower-level R-codes from Stackoverflow (sorry I forgot the
documentation link), and am going to feature those codes on the appendix
as well
count_domains <- to_viz %>%
select(cleaned_domain, medianame) %>%
rename(`Media Outlet` = medianame) %>%
rename(`URL Simplified` = cleaned_domain) %>%
group_by(`Media Outlet`, `URL Simplified`) %>%
summarise("values" = n())
media_treemap <- as.treemapDF(count_domains, valueCol = "values") %>% as.data.frame()
new_parent <- media_treemap$parents %>%
tidyr::replace_na(., "")
media_treemap <- bind_cols(media_treemap, new_parent) %>%
rename(new_parent = ...5)library(plotly)
media_plotly_treemap <-
plot_ly(
data = media_treemap,
type = "treemap",
ids = ~ ids,
labels = ~ labels %>% stringr::str_wrap(width = 15),
parents = ~ new_parent,
values = ~ values,
domain = list(column = 2)
) %>% layout(title = list(text="Media Desk Spread: Foldable Phone Product Launch and Prelaunch", y = 0.98, x = 0.55, xanchor = 'center', yanchor = 'top')) %>%
layout(uniformtext = list(minsize = 10)) %>% layout(colorway = ~
c("#005E7C",
"#001242",
"#632e60",
"#162F20")) %>%
config(displayModeBar = FALSE)media_plotly_treemapThis is the plotly treemap visualization which was generated from our
data. We can certainly depict on how various the media names and media
desks here.
From the left to the right, we can see the media names sorted by the
number of coverage (and so with variety on their media desks). There are
Detik, Kurio, Teknosignal,
Kompas, Line (Line Today–The media agregator),
Siapgrak, Tribunnews,
Pikiran-rakyat, Grid, and many more.
Speaking about coverage, it is actually resembled by the area. What makes me into this treemap visualization, is the interactivity from Plotly that making it hoverable and clickable (also tidier compared to the other package). I hope that we can have fun, to explore the interactivity of this plot.
I interested to analyze several media names: they are
Detik, Kompas, Tribunnews, and
Antaranews; since those media names has various media desks
based on the treemap figured.
varieties_and_numbers <-
to_viz %>%
group_by(cleaned_domain, medianame) %>%
summarise("coverage" = n()) %>%
arrange(desc(coverage))detik <-
varieties_and_numbers %>%
filter(medianame %in% "detik")
detiksum(detik$coverage)#> [1] 145
The first media name we got here is Detik(.com). As
visualized on the treemap, Detik has eight different media
desks, namely inet.detik.com,
wolipop.detik.com, finance.detik.com,
travel.detik.com, hot.detik.com,
health.detik.com, sport.detik.com, and
20.detik.com. In terms of coverage, Detik
plays a significant article buzz, with 145 articles in total.
When it comes to the spread of media desks, Detik seems to put most
articles on inet.detik.com (Inet refers to “internet and
all tech related stuffs” – 70 Articles), followed by
wolipop.detik.com (wolipop resembles lifestyle contents –
20 articles). These media desk placements seems understandable, since
all the articles are speaking about smartphone Product Launch and
Prelaunch.
Detik’s coverage was followed by Kurio and
Teknosignal. But none of them have media desks. So, I am
heading to the next one, which is Kompas.54
varieties_and_numbers %>%
filter(medianame %in% "kompas")By coverage, kompas published 105 articles on this product launch and
prelaunch campaign. Their mediadesks seems not so various, as it
featured mostly on tekno.kompas.com; followed by
kompas.tv, beta.kompas.tv, and
indeks.kompas.com.
The next worth-to-analyze data point is Tribunnews. On
this kind of media, they prefer to specify desks based on region
(city/province name). However, on this case, they tend to put our
client’s articles on tribunnews.com; which is their main
domain.
varieties_and_numbers %>%
filter(medianame %in% "tribunnews")I found something interesting on this media. Since,
Pikiran-rakyat has so many media desks; with
mantrasukabumi.pikiran-rakyat.com on the biggest coverage
value (13 Articles). One thing to analyze further; did they spread the
articles based on domain performance? Or is there any other
preference?
varieties_and_numbers %>%
filter(medianame %in% "pikiran-rakyat")Grid also prefer to place our client’s articles on
nextren.grid.id (read: next trend), followed by
infokomputer.grid.id (read: the infos about
computer/gadgets). Their desk placements also seems reasonable.
varieties_and_numbers %>%
filter(medianame %in% "grid")Antaranews has the same pattern as
Tribunnews. In coverages and desk varieties, they were not
the biggest one
varieties_and_numbers %>%
filter(medianame %in% "antaranews")as.treemapDFas.treemapDF <- function(DF, valueCol = NULL){
require(data.table)
colNamesDF <- names(DF)
if(is.data.table(DF)){
DT <- copy(DF)
} else {
DT <- data.table(DF, stringsAsFactors = FALSE)
}
DT[, root := ""]
colNamesDT <- names(DT)
if(is.null(valueCol)){
setcolorder(DT, c("root", colNamesDF))
} else {
setnames(DT, valueCol, "values", skip_absent=TRUE)
setcolorder(DT, c("root", setdiff(colNamesDF, valueCol), "values"))
}
hierarchyCols <- setdiff(colNamesDT, "values")
hierarchyList <- list()
for(i in seq_along(hierarchyCols)){
currentCols <- colNamesDT[1:i]
if(is.null(valueCol)){
currentDT <- unique(DT[, ..currentCols][, values := .N, by = currentCols], by = currentCols)
} else {
currentDT <- DT[, lapply(.SD, sum, na.rm = TRUE), by=currentCols, .SDcols = "values"]
}
setnames(currentDT, length(currentCols), "labels")
hierarchyList[[i]] <- currentDT
}
hierarchyDT <- rbindlist(hierarchyList, use.names = TRUE, fill = TRUE)
parentCols <- setdiff(names(hierarchyDT), c("labels", "values", valueCol))
hierarchyDT[, parents := apply(.SD, 1, function(x){fifelse(all(is.na(x)), yes = NA_character_, no = paste(x[!is.na(x)], sep = ":", collapse = " - "))}), .SDcols = parentCols]
hierarchyDT[, ids := apply(.SD, 1, function(x){paste(x[!is.na(x)], collapse = " - ")}), .SDcols = c("parents", "labels")]
hierarchyDT[, c(parentCols) := NULL]
return(hierarchyDT)
}count_domainsmedia_treemapI am using values from media_treemap, but this
visualization was embedded from Flourish, a client-side graphing
provider (same as Plotly Express)